Segmenting Chinese in Unicode
نویسنده
چکیده
The automatic segmentation of Chinese text is an ongoing problem in information retrieval (IR) and computational linguistics: “words” in written Chinese are not delimited by spaces so tokenizing (the first phase of many IR tasks) is considerably more difficult than for Western languages. This paper presents an overview of the segmentation problem, detailing previous research into its solution and introduces Basis Technology’s Chinese Morphological Analyzer (CMA), a new, general purpose hybrid segmentation system. The CMA is Unicode based, and can handle both Simplified and Traditional Chinese text from a variety of locales, including Mainland China, Taiwan, Hong Kong, and Singapore.
منابع مشابه
Supporting Chinese Character Variants in Hong Kong through Ideographic Variation Sequence
This paper will introduce an ongoing project in Hong Kong that makes use of the Ideographic Variation Sequence (IVS) and the associated Ideographic Variation Database (IVD) developed by the Unicode Consortium for character glyph registration. Hong Kong uses the traditional Chinese writing system similar to that of Taiwan and thus used the Big5 encoding for many years. But, Chinese characters us...
متن کاملBuilding a Collation Element Table for a Large Chinese Character Set in YES
YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete ...
متن کاملALT-J/C - a prototype Japanese-to-Chinese automatic language translation system
This paper describes a prototype Japanese-to-Chinese automatic language translation system. ALT-J/C (Automatic Language Translator Japanese-to-Chinese) is a semantic transfer based system, which is based on ALT-J/E (a Japanese-to-English system), but written to cope with Unicode. It is also designed to cope with constructions specific to Chinese. This system has the potential to become a framew...
متن کاملUnicode Chinese paleography : making the evolutionary leap from bone , bronze , silk , and paper , to electronic bits Dr . Richard
As more and more rare characters are encoded, Unicode provides better and better support for Chinese. In conjunction with CDL technology, texts of many historical periods can now be digitized with unparalleled accuracy. For even greater accuracy, a move beyond the encoding of modern-style CJK characters is required, and specialists from all over the world have begun to express interest in worki...
متن کاملA Unicode Based Adaptive Segmentor
This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the preprocessing stage. The segmentor further uses tagging information to work on disambiguation. Adopting a modular design app...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000